Skip to content

5547 Stream Distributions CSV instead of pregenerating#5523

Open
jparcill wants to merge 5 commits intorubyforgood:mainfrom
jparcill:jparcill/5547-stream-csv-instead-of-generating-before-send
Open

5547 Stream Distributions CSV instead of pregenerating#5523
jparcill wants to merge 5 commits intorubyforgood:mainfrom
jparcill:jparcill/5547-stream-csv-instead-of-generating-before-send

Conversation

@jparcill
Copy link
Copy Markdown
Contributor

@jparcill jparcill commented Mar 27, 2026

Resolves #5477

Description

Stream the exporting of the Distributions CSV instead of generating the CSV before download. This allows download to start immediately and at this point the download can occur as long as the client wants.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • Changed the rspec to handle streams

Another test that I did was to simulate a large download with the following code on my local env.

1000.times do
  distributions.each do |distribution|
    yield CSV.generate_line(build_row_data(distribution))
  end
end

This downloaded a 3.5 MB file over >30 seconds.

include ItemsHelper

def initialize(distributions:, organization:, filters: [])
# Currently, the @distributions are already loaded by the controllers that are delegating exporting
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is actually accurate. I originally used find_each when trying to export on my localenvironment and the code worked.

indicating that the distributions were not in memory yet

I've since switched to using each instead just so that the spec file runs since the tests expect an array

@jparcill jparcill changed the title Jparcill/5547 Stream Distributions CSV instead of pregenerating 5547 Stream Distributions CSV instead of pregenerating Mar 30, 2026
@dorner
Copy link
Copy Markdown
Collaborator

dorner commented Apr 12, 2026

So I don't think this really solves the problem - while it won't technically time out, the user will be stuck waiting for a huge amount of time and will have no idea if it's actually working or not.

I'd suggest we either:

  1. Figure out why the export is so slow and speed it up (ideal) or
  2. Turn this into a delayed process and send an e-mail to the user once it's done.

@jparcill
Copy link
Copy Markdown
Contributor Author

I think a big part of why this is slow is because the current implementation loads all the rows into memory before converting it to a CSV.

I can batch the SQL requests if we wanna avoid that. In this PR's state I think 3 seconds for 30k rows is not a long wait. This is in contrast to where 30k rows would just time out in the current world

I think in the email method we'd run into email size limits with larger csvs. We'd have to do some S3 solution there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Exporting a year of distributions times out

2 participants